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Abstract — With  ever  increasing  data  in  form  of  e-files,  there 
always  has  been  a  need  of  a  good  application  to  search  for 
information  in  those  files  efficiently.  This  paper  extends  the 
implementation  of  our  previous  algorithm  in  the  form  of  a 
windows  application.  The  algorithm  has  the  search  time- 
complexity  of  0(n)  with  no  pre-processing  time  and  thus  is 
very  efficient  in  searching  sentences  in  a  pool  of  files. 

Index  Terms — Text  search,  sentence  searching,  searching  in 
files  application 

I.  Introduction 

In  this  21st  century,  everything  is  getting  documented 
day  by  day.  We  are  piling  up  lots  and  lots  of  files  that  give 
rise  to  a  need  of  good  text  searching  applications.  We  have 
very  few  efficient  applications  that  can  search  within  files. 
The  description  of  the  application  presented  uses  'A  Fast 
Sentence  Searching  Algorithm'  for  searching  text/sentences 
in  the  files  [8].  The  main  focus  of  the  application  is  to  search 
any  sentence  in  the  given  pool  of  files  in  various  folders  or 
drives  so  that  desired  file  can  be  searched  on  the  basis  of 
given  information  in  the  form  of  a  sentence  or  a  small 
paragraph.  There  exists  various  text  searching  algorithms  like 
KMP,  Boyre-Moore  which  can  be  efficient  in  case  of  patterns 
but  while  searching  for  sentences  the  performance  of  our 
algorithm  is  better  than  the  rest  of  the  algorithms,  so  it  has 
been  chosen  for  the  application  [1],  [2],  [3]  and  [4]. 

II.  Related  Work 

Amongst  the  several  text-searching  algorithms  designed 
until  now,  the  simplest  one  is  the  Naive  or  Brute-Force 
Algorithm.  Rabin-Karp  is  another  searching  technique  that 
makes  use  of  elementary  number- theoretic  notations  such  as 
equivalence  of  two  numbers  modulo  a  third  number.  Other 
algorithm  is  the  Knuth-Morris — Pratt  algorithm  that  is  a  linear 
time  string-matching  algorithm  [5] .  This  algorithm  uses  a  prefix 
function  n  that  encapsulates  knowledge  about  how  the  pattern 
matches  against  shifts  or  itself.  Now  the  most  commonly 
used  text-searching  algorithm  is  the  Boyre-Moore  Algorithm 
that  takes  a  sub-linear  searching  time  [6] .  It  uses  two  functions 
i.e.  a  bad  character  and  a  good  prefix  functions  require  certain 
preprocessing.  Let  m  be  the  length  of  the  sentence  and  let  n 
be  the  length  of  the  search  space  (file).  The  Table-I  gives  the 
comparison  of  the  asymptotic  time  analysis  of  various  text 
searching  algorithms.  A  very  little  work  has  been  performed 
in  this  area.  There  is  no  general  purpose  system  available  till 
today  that  provides  the  facility  of  searching  desired  file  on 
the  basis  of  information  available  [7],  and  [8]. 
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Table  I.  Comparative  AsymptoticTime  Analysis 


S.  No. 

Algorithm 

Preprocessing 
time 

Matching 
time 

1 

Naive  StrinE  Search  AlEoriihm 

0{no 
preprocessing) 

0{{n-m-U) 
m) 

2 

Rabin-Karp  EtrinE  search 
algorithm 

0(m) 

av  erage 

&(n-m) . 

worst  0({n- 

m4l)  m) 

3 

Finite  state  automaton  based 

G(m  E) 

•M 

4 

KMP  algorithm 

©Cm) 

0(n) 

5 

Boyer-Moore  saarch 
Algorithm 

G(m  +  jZO 

2(Wm)3  <Xn) 

6 

Our  Algorithm 

0{no 
preprocessing) 

0(n) 

III.  Algorithm 

The  algorithm  used  in  the  application  for  searching  some 
sentence/search: 
l.While(!EndOfFile) 

2.  Do  read  a  single  character  from  file,  x 

3.  pos  •—  pos+1 

4.  If  sentence[i]=x  then 

5.  i«-i+l 

6.  Else 

7.  i  «_  0 

8.  If  sentence[i]=x  then 
9.i  f-i+1 

10.  If  i  =  LengthOfSentence  then 
ll.c«_CH-l 

12.  i  ~  0 

13.  Return  c 

The  above  algorithm  returns  'c',  i.e.  the  number  of  times 
the  sentence  to  be  searched  occurs  in  a  single  file.  It  can 
scan  more  than  one  file;  one  by  one  and  thus  help  in 
distinguishing  between  the  set  of  files  that  contain  a  sentence 
or  paragraph  and  the  ones  that  do  not  contain  it. 

The  above  algorithm  works  by  scanning  the  file  character 
by  character  and  comparing  each  character  of  the  file  with 
the  ones  in  the  sentence  to  we  wish  to  search.  We  may  see 
the  algorithm  in  two  phases  as  described  in  the  following 
lines. 

(a)  Initially,  we  compare  the  first  character  of  the  file  is 
with  the  first  character  of  the  sentence  to  be  searched.  If 
there  is  a  match,  we  increment  i  else  we  set  the  pointer  again 
to  0  and  check  for  the  first  character  of  the  sentence. 

(b)  We  now  check  the  value  of  T  if  it  is  equal  to  length  of 
the  sentence  or  not.  Value  of  T  will  be  equal  to  the  length  of 
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sentence  only  in  a  condition  if  the  sentence  is  found  thus  we 
increment  the  value  of  'c' . 

IV.  Theoretical  Analysis 

Considering  the  illustrated  algorithm,  we  can  see  that  the 
complete  complexity  of  searching  a  sentence  in  a  file  is  equal 
to  ®(n)  without  having  any  pre-processing  time  where  n  is 
the  number  of  characters  in  the  file. 

Lines  1-12  show  that  this  particular  loop  continues  until 
the  end  of  file  i.e.  iterates  'n'  (no.  of  characters  in  file)  times. 
Line  2  reads  a  single  character  at  a  time  thus  having  O(l) 
complexity.  Similarly,  line  3  also  executes  once  in  a  loop.  Lines 
4-9  check  whether  the  character  read  from  file  is  present  in 
the  sentence  or  not  and  accordingly,  the  respective  lines 
execute.  In  case  we  find  a  mismatch,  we  check  it  for  the  first 
character  of  the  sentence  we  are  searching.  Lines  10-12  check 
if  the  sentence  is  found  in  the  file  and  accordingly  increment 
the  counter  of  the  number  of  sentences  by  1.  Finally,  line  13 
returns  the  number  of  times  the  sentence  is  present  in  the 
file.  This  clearly  shows  that  there  is  a  single  loop  iterating  'n' 
times.  Thus,  the  complexity  of  the  algorithm  is  ®(n),  under  all 
circumstances  as  the  loop  continues  till  the  last  character  of 
the  file  whether  or  not  the  sentence  is  present  in  the  file  [5], 
[6],  [7],  and  [8]. 

V  Implementation 

The  algorithm  has  been  implemented  in  C#.NET  using 
Visual  Studio  as  IDE.  There  is  an  option  of  a  single  file  or  a 
complete  folder  that  allows  you  to  quickly  search  inside  the 
files  on  the  drive  or  network.  It  can  easily  retrieve  the 
documents  that  contain  the  multiple  sentences  and  phrases 
that  one  is  interested  in.  Figure  1  shows  the  snapshot  of  the 
application's  working.  The  list  of  files  containing  the  sentence 
will  be  displayed  that  can  redirect  to  the  file  on  clicking  it. 
The  search  can  be  performed  on  PDF,  DOC,  TXT,  HTML  and 
PPT  files.  Some  of  the  extra  features  that  have  been  included 
in  our  application  are 

Normal  Searching: 

Normal  Searching  allows  the  use  of  the  question  mark  (?) 
and  asterisk  (*)  to  match  one  and  one  or  more  characters 
respectively.  All  white  space  is  treated  the  same  and  multiple 
white  space  characters  are  treated  as  one. 

Search  a  drive,  path  or  multiple  drives  and  paths: 

Such  as  C:\  I  \\Corp-backup\C\Accounting 

Exclude  specific  folders  or  paths: 

C:\  I  -Windows  I  -Program  Files 

This  option  would  exclude  the  folders  Windows  and 
Program  Files  and  all  their  subfolders 

Restrict  to  specific  file  types  and  patterns: 

*.doc  I  *.rtf 

This  would  check  only  files  with  names  ending  in  doc  or 
rtf. 

Exclude  specific  extensions: 

-*.bak  I  -*.tmp  I  — * 
This  option  would  search  for  all  files  except  those  that  have 
the  extension  bak,tmp  or  that  start  with  the  tilde  character. 


*^J 
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iter  Sentence/Paragraph  to  be  searched 
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Search  Complete 

Fig.  1 .  Application  Screenshot 

Conclusion  And  Future  Scope 

With  increased  use  of  computer  for  documenting  almost 
everything,  we  need  such  applications  that  can  help  searching 
in  those  documents.  This  application  can  be  very  useful  as 
there  are  very  few  applications  that  solve  this  purpose. 
Secondly,  since  it  uses  efficient  algorithm  for  searching,  the 
results  are  computed  at  a  very  faster  rate  and  thus  saving  a 
lot  of  useful  time. 

There  will  be  a  revolutionary  change  in  the  working  of 
various  offices  of  different  organization.  This  will  provide  a 
user  friendly  way  to  search  desired  file  or  files  on  little 
information  available  from  various  media.  Sometimes  people 
are  not  able  to  find  the  desired  files  as  the  number  of  files 
becomes  very  large  and  spread  in  various  folders.  A  natural 
language  interface  to  the  system  may  be  developed  so  that  it 
will  be  more  users  friendly  in  the  offices  of  various 
organizations.  We  have  a  plan  to  integrate  the  developed 
search  engine  with  English  language.  Later  on  this  search 
engine  may  be  extended  for  Hindi  language  files  along  with 
Hindi  language  interface. 
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